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Abstract 

Elucidating the content of a DNA sequence is critical to deeper understand and 
decode the genetic information for any biological system. As next generation 
sequencing (NGS) techniques have become cheaper and more advanced in 
throughput over time, great innovations and breakthrough conclusions have been 
generated in various biological areas. Few of these areas, which get shaped by the 
new technological advances, involve evolution of species, microbial mapping, 
population genetics, genome-wide association studies (GWAs), comparative 
genomics, variant analysis, gene expression, gene regulation, epigenetics and 
personalized medicine. While NGS techniques stand as key players in modern 
biological research, the analysis and the interpretation of the vast amount of data 
that gets produced is a not an easy or a trivial task and still remains a great 
challenge in the field of bioinformatics. Therefore, efficient tools to cope with 
information overload, tackle the high complexity and provide meaningful 
visualizations to make the knowledge extraction easier are essential. In this article, we 
briefly refer to the sequencing methodologies and the available equipment to serve 
these analyses and we describe the data formats of the files which get produced by 
them. We conclude with a thorough review of tools developed to efficiently store, 
analyze and visualize such data with emphasis in structural variation analysis and 
comparative genomics. We finally comment on their functionality, strengths and 
weaknesses and we discuss how future applications could further develop in this 
field. 

Keywords: SNPs, SNVs, CNV, Structural variation, Sequencing, Genome browser. 
Visualization, Polymorphisms, Genome wide association studies 



Introduction 

High throughput sequencing (NGS) techniques have brought a remarkable revolution 
in the field of biology and other closely related fields and have shaped a new trend of 
how modern biological research can be done at a large scale level With the advances 
of these techniques, it is feasible nowadays to scan and sequence a whole genome or 
exome at a base pair level at a low error rate, in an acceptable time frame and at a 
lower cost. 

Based on the first Sanger sequencing technique, the Human Genome Project (1990- 
2003), allowed the release of the first human reference genome by determining the se- 
quence of ~3 billion base pairs and identifying the approximately -25,000 human genes 
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[1-3]. That stood as a great breakthrough in the field of comparative genomics and genet- 
ics as one could in theory directly compare any healthy or non-healthy sample against a 
golden standard reference and detect genetic polymorphisms or variants that occur in a 
genome. Few years later, as sequencing techniques became more advanced, more accurate 
and less expensive, the 1000 Human Genome Project [4] was launched (January 2008). 
The main scope of this consortium is to sequence, -1000 anonymous participants of dif- 
ferent nationalities and concurrently compare these sequences to each other in order to 
better understand human genetic variation. Recently, as a result of the project, 1092 such 
human genomes were sequenced and published [5]. The International HapMap Project 
(short for "haplotype map") [6-10] aims to identify common genetic variations among 
people and is currently making use of data from six different countries. 

Shortly after the 1000 Human Genome Project, the 1000 Plant Genome Project 
(http://www.onekp.com) was launched, aiming to sequence and define the transcrip- 
tome of -1000 plant species from different populations around the world. Notably, out 
of the 370,000 green plants that are known today, only -125,000 species have recorded 
gene entries in GenBank and many others still remain unclassified [11]. While the 1000 
Plant Genome Project was focused on comparing different plant species around the 
world, within the 1001 Genomes Project [12], 1000 whole genomes of A, Thaliana 
plants across different places of the planet were sequenced. 

Similar to other consortiums, the 10,000 Genome Project [13] aims to create a collec- 
tion of tissue and DNA specimens for 10,000 vertebrate species specifically designated 
for whole-genome sequencing. In addition, the overarching goal of the 1000 Fungal 
Genome Project (FIOOO - http://1000.fungalgenomes.org) is to explore all areas of fungal 
biology by providing broad, genomic coverage of Kingdom Fungi. Notably, sequencing 
advances have paved the way to metagenome sequencing, which is defined as an ap- 
proach for the study of microbial populations in a sample representing a community by 
analysing the nucleotide sequence content. Moreover, NGS will allow for the accurate 
detection of pan-genomes which describe the full complement of a superset of all the 
genes in all the strains of a species, typically applied to bacteria and archaea [14]. 

In the near future, sequencing techniques are expected to become even less time- 
consuming and more cost-effective in order to screen whole genomes within a few 
hours or even minutes. While sequencing techniques improve and develop overtime, 
the amount of data produced increases exponentially and therefore the implementation 
of efficient platforms to analyze and visualize such large amounts of data in fast and ef- 
ficient ways has become a necessity. Following a top-down approach, the current re- 
view starts with an overview of generic visualization and analysis tools and file formats 
that can be used in any next generation sequencing analysis. While such tools are of a 
broad usage, the current review progressively focuses on their application in structural 
variation detection and representation and in parallel, commenting on their strengths 
and weaknesses, giving insights on how they could further develop to handle the over- 
load of information and cope with the data complexity. It is not the scope of this article 
to describe in depth the existing sequencing techniques, but readers are strongly en- 
couraged to follow a more detailed description about the widely used sequencing tech- 
nologies in [15,16]. Thorough explanations of how hundreds of thousands or even 
millions of sequences can be generated by such high-throughput techniques is 
presented in [17,18] while sequence assembly strategies are extensively explained in 
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[19]. The advantages and the limitations of the aforementioned techniques are 
discussed in [20,21]. 

Sequencing technologies 

First, second and third generation 

Sequencing techniques are chronologically divided into 3 generations: the first, the sec- 
ond and the third. The key principle of the first generation (Sanger or dideoxy) sequen- 
cing techniques, which was discovered in 1977, was the use of dideoxy nucleotide 
triphosphates (ddNTPs) as DNA chain terminators so that the labeled fragments could 
be separated by size using gel electrophoresis. Dye-terminator sequencing discovered in 
the late 90s, utilizes labeling in a single reaction, instead of four reactions (A,T,C,G). In 
dye-terminator sequencing, each of the four ddNTPs is labeled with fluorescent dyes, 
each of which emits light at different wavelengths. Dye-terminator sequencing com- 
bined with capillary electrophoresis succeeded in speeding up performance and became 
one of the most standardized and widely used techniques. 

Second generation high-throughput sequencing techniques generate thousands or 
millions of short sequences (reads) at higher speed and better accuracy. Such sequen- 
cing approaches can immediately be applied in relevant medical areas where previous 
Sanger-based trials fell short in capturing the desired sequencing depth in a manageable 
time-scale [22]. High-throughput second generation commercial technologies have 
already been developed by Illumina [23,24], Roche 454 [25] and Biosystems/SOLiD. 
Today Illumina is the most widely used platform despite its lower multiplexing capability 
of samples allowed [26]. Recent HiSeq Illumina systems make it possible for researchers 
to perform large and complex sequencing studies at a lower cost. Cutting-edge innova- 
tions can dramatically increase the number of reads, sequence output and data generation 
rate. Thus, researchers are now able to sequence more than five human genomes at -'30x 
coverage simultaneously or -100 exome samples in a single run. 

Helicos Biosciences (http://www.helicosbio.com/). Pacific Biosciences (http://www. 
pacificbiosciences.com/), Oxford Nanopore (http://www.nanoporetech.com/) and Complete 
Genomics (http://www.completegenomics.com/) belong to the third generation of sequen- 
cing techniques, each of which have their pros and cons [16,27,28]. These techniques are 
promising to sequence a human genome at a very low cost within a matter of hours. 

While today, first generation sequencing is not used due to its forbidden cost and time 
consumption, second generation sequencing technologies are widely used due to their 
lower cost and time efficiency. Such techniques have led to a plethora of applications such 
as DNA-seq and assembly to determine an unknown genome from scratch or look for 
variations among genome samples, RNA-seq [29,30] to analyse gene expression or 
ChlP-seq [31] to mainly identify DNA regions that are binding sites for proteins, 
such as transcription factors. It is not the scope of this review to describe the aforemen- 
tioned techniques into depth but we give a short description of DNA sequencing and as- 
sembly and we explain below how this can be used to discover structural variations. 

DNA sequencing and assembly 

DNA sequencing can be applied to very long pieces of DNA such as whole chromo- 
somes or whole genomes, but also on targeted regions such as the exome or a selection 
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of genes pulled- down from assays or in solution. There are two different scenarios 
under which DNA sequencing is carried out. In the first case a reference genome for 
the organism of interest already exists, whereas in the second case of de novo sequen- 
cing, there is no reference sequence available. The main idea behind the reference gen- 
ome approach consists of 3 general steps: Firstly, DNA molecules are broken down 
into smaller fragments at random positions by using restriction enzymes or mechanical 
forces. Secondly, a sequencing library consisting of such fragments of known insert size 
is created, while during a third step, these fragments are sequenced and finally mapped 
back to an already known reference sequence. The general methodology is widely 
known as shotgun sequencing. The aforementioned process is depicted in Figure 1. In 
the case of de novo sequencing, where there is no a priory catalogued reference se- 
quence for the given organism, the small sequenced fragments are assembled into 
contigs (groups of overlapping, contiguous fragments) and the consensus sequence is 
finally established from these contigs. This process is often compared to putting to- 
gether the pieces of a jigsaw puzzle. Thus, the short DNA fragments produced are 
assembled electronically into one long and contiguous sequence. No prior know- 
ledge about the original sequence is needed. While short read technologies produce 
higher coverage, longer reads are easier to process computationally and interpret 
analytically, as they are faster to align compared to short reads because they have 
higher significant probabilities to align to unique locations on a genome. Notable 
tools for sequence assembly are the: Celera [32], Atlas [33], Arachne [34], JAZZ [35], 
PCAP [36], ABySS [37], Velvet [38] and Phusion [39]. The accuracy of this approach 
increases when comparing larger sized fragments (resulting in larger overlaps) of less 
repetitive DNA molecules. For larger genomes, this method has many limitations 
mainly due to the smaller size of reads and its high cost. The aforementioned 
process is displayed in Figure 2. 
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Figure 1 DNA sequencing. DNA sequencing: 1st step: The DNA of interest is purified and extracted. 2nd 
step: Creation of multiple copies of DNA. 3nd step: DNA is shattered into smaller pieces. 4rd step: DNA 
fragment sequencing. 5th step: A computer maps the small pieces to an already known reference genome. 
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Figure 2 DNA assembly. DNA assembly: 1st step: The DNA is purified and extracted. 2nd step: DNA is 
fragmented into smaller pieces. 3rd step: DNA fragment sequencing. 4th step: A computer matches the 
overlapping parts of the fragments to get a continuous sequence. 5th step: The whole sequence is 
reassembled. No prior knowledge about the DNA sequence is necessary. 



The structural variome 

A single nucleotide polymorphism (SNP), or equally a single nucleotide variation (SNV), 
refers to a single nucleotide change (adenine-A, thymine-T, guanine-G, and cytosine-C) 
in genomic DNA which is observed between members of the same biological species or 
paired chromosomes in a single individual. A SNP example is shown in Figure 3. SNPs 
are single nucleotide substitutions, which are mainly divided into two types: transitions 
(interchanges of two purines or two pyrimidines such as A-G or C-T) and transversions 
(interchanges between purines and pyrimidines A-T, A-C, G-T and G-C). There are 
multiple public databases which store information about SNPs. The National Center 
for Biotechnology Information (NCBI) has released dbSNP [40], a public archive for 
genetic variation within and across different species. The Human Gene Mutation Data- 
base (HGMD) [41] holds information about gene mutations associated with human 
inherited diseases and functional SNPs. The International HapMap Project (short for 
"haplotype map") [6-10] holds information about genetic variations among people, so 
far from containing data from six countries. The data includes haplotypes (several SNPs 
that cluster together on a chromosome), their locations in the genome and their fre- 
quencies in different populations throughout the world. Other databases to be men- 
tioned are the HGBASE [42], HGVbase [43], GWAS Central [44] and SNPedia [45]. 
A great variety of tools to detect SNVs and predict their impact is analytically 
reviewed in [46]. 

Recently, the focus has been shifted to understanding genetic differences in the form 
of short sequence fragments or structural rearrangements (rather than variation at the 
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Figure 3 SNP example. A difference in a single nucleotide between two DNA fragments from different 
individuals. In this case we say that there are two alleles: C and T. 
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single nucleotide level). This type of variation is known as the structural variome. The 
structural variome refers to the set of structural genetic variations in populations of a 
single species that have been acquired in a relatively short time on an evolutionary 
scale. Structural variations are mainly separated in two categories; namely the balanced 
and the unbalanced variations. The basic variations include insertions, deletions, duplica- 
tions, translocations and inversions. Balanced variations refer to genome rearrangements, 
which do not change the total content of the DNA. These are mainly inversions or intra/ 
inter-chromosomal translocations. Unbalanced variations on the other hand, refer to 
rearrangements that change the total DNA content. These are insertions and deletions. 
Unbalanced variations are also called copy number variations (CNVs). Figure 4 shows a 
schematic representation of such intra/inter-chromosomal balanced and unbalanced 
structural variations. 



Methods to detect structural variations 

During the past years, a great effort has been made towards the development of several 
techniques [47] and software applications [46] to detect structural variations in ge- 
nomes. In the case of SNP detection, the differences are extracted from local align- 
ments whereas for the detection of structural variations approaches, such as read-pair 
(RP), read-depth (RD) and split-reads can be used. 

Pair-end mapping (REM) 

According to this approach, the DNA is initially fragmented into smaller pieces. The 
two ends of each DNA fragment (paired end reads or mate pairs) are then sequenced 
and finally get mapped back to the reference sequence. Notably, the two ends of each 
read are long enough to allow for unique mapping back to the reference genome. The 
idea behind this strategy is that the ends of the reads, which align back to the reference 
genome, map back to specific positions of an expected distance according to informa- 
tion from stored DNA libraries. For certain cases, the mapping distance appears to be 
different from the expected length, or mapping displays an alternative orientation from 
that anticipated. These observations can be considered as strong indicators for the oc- 
currence of a possible structural variation. Thus, if the mapped distance is smaller than 
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Figure 4 Structural Variations. This figure illustrates the basic structural variations. A) Inversion. B) Translocation 
within the same chromosome. C) Translocation across different chromosomes. D) Duplication. E) Deletion. 
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the expected one, it could indicate a deletion or vice versa an insertion. The main dif- 
ference between the terms paired end reads and mate pairs, is that while pair-end 
reads provide tighter insert sizes, the mate pairs give the advantage of larger insert sizes 
[47]. Differences and structural variations among genomes can be tracked by observing 
PEM signatures. While PEM signatures together with approaches to detect them are 
analytically described elsewhere [47], some common signatures are shown in Figure 5. 

Single-end 

According to this methodology, multiple copies of a DNA molecule get produced and 
randomly chopped into smaller fragments {reads). These reads are eventually aligned and 
mapped back to a reference genome. The reasoning behind this approach is that various 
reads will map back to various positions across the genome, and exhibit significant over- 
lap of read mapping. By measuring the frequency of nucleotides mapped by the reads 
across the depth of coverage (DOC), it is possible to obtain an evaluation of the number of 
reads that have been mapped to a specific genomic position (see Figure 6). The Depth of 
coverage (DOC) is a significant way to detect insertions or deletions, gains or losses in a 
donor sample comparing to the reference genome. Thus, a region that has been deleted 
will have less reads mapped to it, and vice versa in cases of insertions. Similarly to PEM, 




Figure 5 REM signatures. Basic PEM signatures. A) Insertion. B) Deletion. C) Inversion. More PEM 
signatures are visually presented in [47]. 
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Figure 6 Read depth. Read depth: A) Fragments of DNA (Reads) are mapped to the original reference 
genome. B) Plotting the frequency of each nucleotide that was mapped at the reference genome. 



the aforementioned methodology is an alternative way to extract information about pos- 
sible structural variations described by DOC signatures. While read-depth has a higher 
resolution, it gives no information about the location of the variation and it can only de- 
tect unbalanced variations. DOC signatures, compared to PEM signatures are more suit- 
able to detect larger events, since the stronger the event, the stronger the signal of the 
signature. On the other hand, PEM signatures are more suitable to detect smaller events, 
even with low coverage, but are far less efficient in localizing breakpoints. Available tools 
to detect structural variations and cluster them according to different methodologies are 
presented below. 

Split-reads 

According to this approach, a read is mapped to two separate locations because of pos- 
sible structural variation. The prefix and the suffix of a match may be interrupted by a 
longer gap. This split read mapping strategy is useful for small to medium-sized 
rearrangements in a base pair level resolution. It is suitable for mRNA sequencing, 
where absent intronic arrangements can cause junction reads that span exon-exon 
boundaries. Often, local assembly is used to detect regions of micro-homology or non- 
template sequences around a breakpoint. This is done to detect the actual sequence 
around the break points. 



File formats 

Sequencing techniques generate vast amounts of data that need to be efficiently stored, 
parsed and analyzed. A typical sequencing experiment might produce files ranging from 
few gigabytes to terabytes in size, containing thousands or millions of reads together 
with additional information such as read identifiers, descriptions, annotations, other 
meta-data, etc. Therefore, file formats such as FASTQ [48], SAM/BAM [49] or VCF 
[50] have been introduced to efficiently store such information. 
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FASTQ 

It comes as a simple extension of the FASTA format and it is widely used in DNA se- 
quencing mainly due to its simplicity. Its main strength is its ability to store a numeric 
quality score (PHRED [51]) for every nucleotide in a sequence. FASTQ mainly consists 
of four lines. The first line starts with the symbol '@' which is followed by the sequence 
identifier. The second line contains the whole sequence as a series of nucleotides in up- 
percase. Tabs or spaces are not permitted. The third line starts with the symbol 
which indicates the end of the sequence and the start of the quality string which fol- 
lows in the 4th line. Often, the third line contains a repetition of the same identifier like 
in line 1. The quality string, which is shown in the 4th line, uses a subset of the ASCII 
printable character representation. Each character of the quality string corresponds to 
one nucleotide of the sequence; thus the two strings should have the same length. En- 
coding quality scores in ASCII format, makes FASTQ format easier to be edited. The 
range of printed ASCII characters to represent quality scores varies between different 
technologies. Sanger format accepts a PHRED quality score from 0 to 93 using ASCII 
33 to 126. Illumina 1.0 encodes a Illumina quality score from -5 to 62 using ASCII 59 
to 126. Illumina 1.3+ format can encode a PHRED quality score from 0 to 62 using 
ASCII 64 to 126. Using different ranges for every technology is often confusing, and 
therefore the Sanger version of the FASTQ format has found the broadest acceptance. 
Quality scores and how they are calculated per platform is described in [52]. A typical 
FASTQ file is shown in Figure 7. Compression algorithms such as [53] and [54] suc- 
ceed in storing FASTQ using lower disk space. In order to interconvert files between 
Sanger, Illumina 1.3+ platforms, Biopython [55], EMBOSS, BioPerl [56] and BioRuby 
[57] come with file conversion modules. 

Sequence alignment/Map (SAM) format 

It describes a flexible and a generic way to store information about alignments against 
a reference sequence. It supports both short and long reads produced by different se- 
quencing platforms. It is compact in size, efficient in random access and represents the 
format, which was mostly used by the 1000 Genomes Project to release alignments. It 
mainly supports 11 mandatory and many other optional fields. For better performance, 
store efficiency and intensive data processing, the BAM file, a binary representation of 
SAM, was implemented. BAM files are compressed in the BGZF format and hold the 
same information as SAM, while they require less disk space. SAM can be indexed and 



starting sequence 
symbol \ identifier 
(5)HWI-EAS3X_10102_2_120_19829_1823#0/2 

TCTAACTCTTACTTAGCATAGCTGTTAAAATTTTTGAGTT^ 
^+(optionally the same identifier) sequence 
sequence en^ DEAEE:B:BE5EEEED=:DEA:-AE5DDBDFFEDEEDFAE ^ 

start QS ^^^"'^7 

score 

Figure 7 FASTQ file. 1st line always starts with the symbol '@' followed by the sequence identifier. 2nd 
line contains the sequence. 3rd line starts with the symbol '+' symbol which is optionally followed by the 
same sequence identifier and any description. It indicates the end of the sequence and the beginning of 
the quality score string. 4th line contains the quality score (QS) in ASCII format. The current example shows 
an Illumina representation. 
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processed by specific tools. While Figure 8 shows an example of a SAM file, a very de- 
tailed description of the SAM and BAM files is presented in [58]. 

Variant call format (VCF) 

This specific file type was initially introduced by the 1000 Genomes Project to store the 
most prevalent types of sequence variation, such as SNPs and small indels (inserions/ 
deletions) enriched by annotations. VCFtools [50] are equipped with numerous func- 
tionalities to process VCF files. Such functionalities include validations, merges and 
comparisons. An example of a VCF file is shown in Figure 9. 



Variant calling pipelines 

Variant discovery still remains a major challenge for sequencing experiments. Bioinfor- 
matics approaches that aim to detect variations across different human genomes, have 
identified 3-5 million variations for each individual compared to the reference. It is no- 
ticeable that most of the current comparative sequencing-based studies are mainly 
targeting the exome and not the whole genome, initially due to the lower cost. It is be- 
lieved that variations in the exome can have a higher chance of having a functional im- 
pact in human diseases [59]. However, recent studies show that non-coding regions 
contain equally important disease related information [60]. Sophisticated tools that can 
cope with the large data size, efficiently analyze a whole genome or an exome and ac- 
curately detect genomic variations such as deletions, insertions, inversions or inter/ 
intra chromosomal translocation are currently necessary. Today, only few of such tools 
exist and are summarized in Table 1. Many of the tools are error sensitive, as false neg- 
atives in base calling may lead to the identification of non-existent variants or to miss- 
ing true variants in the sample, something that still remains a bottleneck in the field. 



Variant annotation 

As genetic diseases can be caused by a variety of different possible mutations in DNA 
sequences, the detection of genetic variations that are associated to a specific disease of 
interest is very important. Even though most of the variations detected by variant cal- 
lers are found to be functionally neutral [74] and do not contribute to the phenotype of 
a disease [75], many of them have concluded to important results. In order to better 
identify the causative variations for genetic disorders and characterize them, the 
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Figure 8 BAM/SAM files. Example of an alignment to the reference sequence (pileup). A) Read rOOl/1 and 
rOOl/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. B) The 
corresponding SAM file and their tags for each field. 
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Figure 9 VCF file. This figure demonstrates an example of a CVF file. A) Different types of variations and 
polymorphisms that can be stored in CVF format. B) Example of a CVF format and its fields. 



implementation of efficient variant annotation tools emerges and is one of the most 
challenging aspects of the field. Table 2 summarizes the available software which serves 
this purpose by highlighting the strengths and the weaknesses of each application. 



Visualization of structural variation 

Visualization of high throughput data to provide meaningful views and make pattern 
extraction easier still remains a bottleneck in systems biology. More than ever, such ap- 
plications represent a precious tool for biologists in order to allow them to directly 
visualize large scale data generated by sequencing. The vast amounts of data produced 
by deep sequencing can be impossible to analyze and visualize due to high storage, 
memory and screen size requirements. Therefore, the field of biological data 
visualization is an ever-expanding field that is required to address new tasks in order to 
cope with the increasing complexity of information. While a recent review [87] dis- 
cusses the perspectives and the challenges of visualization approaches in sequencing, 
the tables below emphasize on the strengths and the weaknesses of the available tools 
respectively. 



Alignment tools 

Aligning sequences of long length is not a trivial task. Therefore, efficient tools able 
to handle this load of data and provide intuitive layouts using linear or alternative 
representations i.e. circular are of importance. Table 3 shows a list of the widely used 
applications while also providing an overview of the strengths and weaknesses of 
each tool. 



Table 1 Software for predicting structural variations 



Tool 



Single-End Pair-End Reference Insertion Deletion Inversion Translocation across Translocation within Properties 
genome chromosomes chromosome 



Input File 



BreakDancer [61] 



CNV-seq [62] 



GASV [63] 



HyDRa [64] 



MoDIL [65] 



MrFast [66] 
NovelSeq [67] 

PEMer [68] 



X X 



X X 



X X 



X X 
X 



X X 



• BreakDancerMax for large 
regions and BreokDoncerMini 
for indels of 10-100 bp 

• Shotgun sequencing 

• Robust statistical model 

• Geometric approach 

• A SV is pictured as a 
polygon on a surface 

• Comparison of SVs across 
multiple samples 

• SV breakpoints by 
clustering discordant 
paired-end alignments 

• Medium sized (10-50 bp) 
paired-end indels 

• Able identify shorter 
heterozygous, as well as 
homozygous variants 
with higher accuracy 

• Short sequence reads 
(>25 bp) 

• Long novel sequence 
insertions 

• Multiple types of variations 

• PEMer: variations 

• SV-Simulation: simulated 
paired-end reads 

• BreakDB: annotations 



BAM, SAM 



Map locations from a BAM 
file (by SAM tools) 

BAM 



Tab-delimiteddiscordant 
paired-end mappings 

Software specific 



FASTA, FASTQ 
Software specific 

SVdB API 



Table 1 Software for predicting structural variations (Continued) 

Pindel [69] X X X X 

rSW-seq [70] X XXX 

VariationHunter [71] X XX 

VarScan [72,73] X XXX 



. Large deletions (1 bp-10 kb) BAM,SAM,FASTA, FASTQ 



• Medium sized insertions 
(1-20 bp) from 36 bp 
paired-end sliort reads 

• Based on an iterative 
Smitli-Waterman dynamic 
sequence alignment 
method 

• Evaluation of the entire 
possible mapping set of 
positions of each paired-end 
read and final mapping of 
the SVs interdependently. 

• Germline variants (SNPs and Pileup, VCF 
indels) in individual samples 

or pools of samples. 

• Shared and private variants 
in multi-sample datasets 
(with mpileup). 

• Somatic mutations, LOH 
events, and germline variants 
in tumor-normal pairs. 

• Somatic copy number 
alterations (CNAs) in 
tumor-normal exome 
data. 



Tab-delimited file denoting 
the tumor/normal status 
for each of aligned read 
positions 

Software specific 
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Table 2 Variant annotators 


Tool 


Annotation 


Data support 


Annotate-it [76] 


SNPs, miRNA, Gene, Custom 


Uivilivi, dDbNr, zUU Uanisn genomes, 
NHLBI Exomes, 1000 Genomes 


KGGSeq [77] 


Indels, SNPs, Gene 


dbSNP, 1000 Genomes 


ANNOVAR [78] 


Indels, SNPs, miRNAs, Gene, Custom 


dbSNP, NHLBI Exomes, 1000 Genomes 


Anntools [79] 


Indels, SNPs, miRNAs, Gene, Custom 


dbSNP, 1000 Genomes 


SeqAnt [80] 


Indels, SNPs, Gene 


dbSNP, 1000 Genomes 


SVA [81] 


Indels, SNPs, Gene, Custom 


OMIM, dbSNP, 1000 Genomes 


TREAT [82] 


Indels, SNPs, Gene 


OMIM, dbSNP, 1000 Genomes 


VAAST [83] 


Indels, SNPs 




VarioWatch [84] 


SNPs, Gene 


OMIM, dbSNP, 1000 Genomes 


Var-MD [85] 


SNPs 




VarSifter [86] 


Indels, SNPs 





Genome browsers 

Genome browsers are mainly developed to display sequencing data and genome anno- 
tations from various data sources in one common graphical interface. Initially genome 
browsers were mainly developed to display assemblies of smaller genomes of specific 
organisms, but with the latest rapid technological innovations and sequencing improve- 
ments, it is essential today to be able to navigate through sequences of huge length, 
and simultaneously browse for genomic annotations and other known sources of infor- 
mation available for these sequences. While recent studies [94-96] try to review the 
overlaps and comment on the future of genome browsers, we focus on the most widely 
used ones and we comment on their usability and their strengths as shown in Table 4. 

Visualization for comparative genomics 

Comparative genomics is expected to be one of the main challenges of the next decade 
in bioinformatics research, mainly due to sequencing innovations that currently allow 
sequencing of whole genomes at a lower cost and a reasonable timeframe. Microbial 
studies, evolutionary studies and medical approaches already take advantage of such 
methods to compare sequences of patients against controls, newly discovered species 
with other closely related species and identifying the presence of specific species in a 
population. Therefore, a great deal of effort has been made to develop algorithms that 
are able to cope with multiple, pairwise and local alignments of complete genomes. 
Alignment of unfinished genomes, intra/inter chromosome rearrangements and identi- 
fication of functional elements are some important tasks that are amenable to analysis 
by comparative genomics approaches. Visualization of such information is essential to 
obtain new knowledge and reveal patterns that can only be perceived by the human 
eye. In this section we present a list of lately developed software applications that aim 
to address all of the aforementioned tasks and we emphasize on their main functional- 
ity, their strengths and their weaknesses (see Table 5). 

Discussion 

Advances in high throughput next generation sequencing techniques allow the produc- 
tion of vast amounts of data in different formats that currently cannot be analyzed in a 



Table 3 Alignment tools 



Tool 



Purpose 



Properties 



Support 



Availability 



ABySS Explorer [88] • Global sequence assemblies from smaller 
fragments of DNA 

CLC Genomics • Analysis of de novo assembly 

workbench 



EagleView [89] • Large genome assemblies 



Hawkeye [90] • Detection of anomalies in data and visually 

identify and correct assembly errors 

LookSeq [91] • Visualization of sequences derived from multiple 

sequencing technologies 



MagicViewer [92] • Assembly visualization and genetic variation 
annotation tool mainly developed to easily 
visualize short read mapping 

MapView [93] • Alignments of huge-scale single-end and 

pair-end short reads 



• de-Bruijn directed graphs 



• SNP detection techniques 

• genomic rearrangements structural 
variations 

• Multiple-line scheme 



• Consensus validation of potential genes, 
dynamic filtering and automated 
clustering 

• Browsing at different resolutions 

• Read-depth coverage 

• Putative single nucleotide and SV 

• Identification and annotation of genetic 
variation based on the reference genome 

• Multiple navigation 

• Zooming modes 

• Multi-thread processing 



- DOT files [63] 



• Sanger, 454, lllumina and SOLID 



• Java stand-alone application 



• Commercial stand-alone application 



• Navigation by genomic location, read • Free stand-alone application 
identifiers, annotations, descriptions, 
user-defined coordinate map 



• Compatibility with Phrap, ARACHNE 
[34], Celera Assembler [32] and others 

• SAM/BAM files 



- Multiple color schemes 

- Zoomable interface 

- MapView formatted (MVP) files 



■ Free stand-alone application 



• Web applicastion 



• Pipeline to detect, filter, annotate 
visualize or classify by function genetic 
variations 

• Free stand-alone application 



• Variation analysis 



Table 4 Genome browsers 



Tool 



Purpose 



Properties 



Support 



Availability 



AnnoJ [97] 



• Deep sequencing and other genome 
annotation data 



Argo 



CGView [98] 



Combo [99] 



EnsembI [100,101] 



GBrowse [102,103] 



Genome Projector [105] 



IGB [106] 



• Manual annotations of complete 
genomes 

• Static and interactive graphical maps 
of circular genomes using a circular 
layout 



• Dynamic browser to visualize alignments 
of whole genomes and their associated 
annotations 

• Annotation, analysis and display of 
various genomes 

• Combination of databases and interactive 
web pages to manipulate and display 
genome annotations 



• Circular genome maps, traditional genome 
maps, plasmid maps, biochemical pathways 
maps and DNA walks 



• Optimized to achieve maximum flexibility 
and high quality genome visualization 



• Implemented by users, to handle data and 
render it into a visible form. 

• Plugin architecture 

• Smooth navigation 



• ComBo comparative viewer to view dot 
plots of multiple aligned sequences 

• Export of graphical maps in PNG, JPG or 
SVG formats 

• Generation of a series of hyperlinked maps 
showing expanded views 

• Use of a dot plot view 

• Highlighted views of detailed information 
from specific alignments and annotations 

• Optimized to serve thousands users per 
day and handling large amounts of data 



• GBrowse_syn is an extension to show 
dot-plots for comparative genomics 



• Limited to bacterial species with circular 
chromosomes 



• Visualization of tiling array data, NGS 
results, genome annotations, microarray 
designs and the sequence itself 



• Web 2.0 application implemented in 
JavaScript 



• Distribution of work between the server 
and the client with distant access through 
web services 

• PASTA, Genbank, GFF, BLAST, BED, Wiggle 
and Genscan files 



• Series of hyperlinked maps showing 
expanded views 

• XML formats 



• Zoom in and out at various resolutions 

• Its own file format 

• API for accessing and associating 
genome-scale data from different 
species across the taxonomy 

• "rubber band" interface to allow faster 
zooming 



• Google Maps API to offer smoother 
navigation and better searching 
functionality 

• It comes with its own API 

• Rapid navigation through multiple 
zooming scales and across large regions 
of genomic sequence 



• Web 2.0 javascript application 



• Stand-alone java application 
that can be launched as an 
applet or a java web start 

Implemented in Java and it 
comes with its own API 



• Stand-alone java application 



' Web application 



• Component of the Generic 
Model Organism System 
Database Project (GMOD) [104] 

• HTML/Javascript 

• Web application 



• Stand-alone java application 



Table 4 Genome browsers (Continued) 



IGV [107] 



UCSC Cancer 
Genomics 
Browser [108] 

UCSC Genome 
Browser [109] 



• High-performance and ability to interactively 
explore and integrate large datasets 



- Integration of clinical data 



• Rapid linear visualization, examination, and 
querying of the data at many levels and it 
currently accommodates genomes of -50 
species 



X:map [110] 



• Mappings between genomic features and 
Affymetrix microarrays 



• Sequence alignments, microarrays, 
and genomic annotations 



• Heatmaps 

• Boxplots 

• Proportions 

• Gene Sorter, expression, homology and 
other information among related groups of 
genes. 

• Blat: mapping any sequence to the 
genome while the Table Browser provides 
direct access to the underlying database. 

• VisiGene: browsing through a large 
collection of in situ mouse and frog 
images to examine expression patterns 



• Location of individual exon probes with 
respect to their target genes, transcripts and 
exons. 



• Ability to handle huge datasets and 
diverse data sources and formats 

• Great variety of input file formats 

• Integration of meta-data as heatmaps 
for deeper analysis 

• Searching capabilities to find patterns 
in the huge amounts of clinical and 
genomic data that are 

gathered in large-scale cancer studies 

• Annotation datasets: mRNA alignments, 
mappings of DNA repeat elements, gene 
predictions, 

gene-expression data, disease-association 
data 



• Panning, zooming, and dragging 
capabilities increase the quality of 
interaction 

• Uploading a large variety of files 

• User specific customized sessions. 

• Google Maps API to analyse and further 
visualize data through an associated 
BioConductor package 



• Standalone application 



• Web application 



• Genome Graphs for 
uploading and displaying 
genome-wide data sets 



' Web application 



Table 5 Comparative genomics 

Tool Purpose 



Properties 



Support 



Cinteny [111] 



ggbio [112] 



GenomeComp [113] 



Fast identification of syntenic regions 



Views of particular genomic regions and 
genome-wide overviews 



A tool for summarizing, parsing and visualizing 
a genome wide sequence comparison 



• Flexible parameterization 

• User-provided data such as orthologous 
genes, sequence tags or other markers 

• ideograms 

• grand linear views 

• sequence fragment length 

• edge-linked interval to data view, 

• mismatch pileup, 

• several splicing summaries 

• A tool to locate the rearrangements, 
insertions or deletions of genome 
segments between species or strains 



• Pre-loaded annotated mammalian, 
invertebrate and fungal genomes 



■ Bioconductor Library 



• Fasta format 

• Genbank format 

• EMBL format 



Circos [114] 



DHPC [115] 



HilbertVis [116] 



Developed to identify and analyze similarities 
and differences between larger genomes 



Visualization of large-scale genome sequences 
by mapping sequences into a two-dimensional 
using the space-filling function of Hilbert-Peano 
mapping. 



Functions to visualize long vectors of integer 
data by means of Hilbert curves 



• Circular layout 

• Scatter, line, and histogram plots, heat 
maps, tiles, connectors, and text 

• Repeating sequences 

• Degree of base bias 

• Regions of homogeneity and their 
boundaries, 

• Mark of annotated segments such as 
genes or isochores. 

• Chip-Seq data 

• Chip-chip data 



BLAST output file 

• It supports its own file format 



• DNA sequences can be loaded in 
plain text or FASTA forma 



• The stand-alone version can load 
GFF, BED/Wiggle and Maq map files. 



Table 5 Comparative genomics (Continued) 



In-GAVsv [117] 



Detection and visualization of structural variation 
from paired-end mapping data and detection 
of larger insertions and complex variants with 
lower false discovery rate 



Meander [1 1£ 



It is mainly developed to visually discover and 
explore structural variations in a genome based 
on Read-Depth and Pair-end information 



MEDEA [119] 



Genomic feature densities and genome alignments 
of circular genomes 



MizBee [120] 
Seevolution [121] 



Synteny browser for exploring conservation 
relationships in comparative genomics data 



Interactive 3D environment that enables visualization 
of diverse genome evolution processes 



• Exploration at different zoom levels 
of detail 



• Identification of different types of 
SVs, including large indels, inversions, 
translocations, tandem duplications 
and segmental duplications. 

• Distinction between homozygous 
and heterozygous variants 

• Linear view 

• Hilbert curve -based view 

• Comparison between up to four samples 
against a reference simultaneously 

• Visualization ofvarious types of structural 
inter/intra chromosomal variations 

• Exploration of data at different resolution 
levels 

• Customization of since tracks can by 
dragging and dropping into a desired 
position 

• User-defined color schemes 

• Zooming into specific regions and 
smooth navigation 

• Side-by-side linked views and data 
visualization at different scales, from 
the genome to the gene 



• The R packages HilbertVis and 
HilbertVisGUI are integrated in the 

R / Bioconductor statistical environment 
and can display any data vector 
prepared with R. 

• A PASTA formatted reference sequence 
and a SAM alignment are required 

• A P^ formatted annotation file for 
the reference sequence is optional. 



• It supports its own file format both for 
RD and paired-end data 



■ It supports its own file format 



• Edge hustling and layering to increase 
visual signals about conservation 
relationships related to closeness, 
size, relationship, and orientation. 



• Interactive animation of mutation histories 
involving genome rearrangement, point 



Table 5 Comparative genomics (Continued) 



Sybil [122] Comparative genome data, witli a particular 

importance on protein and gene clustered 
data 



VISTA [123] Global DNA sequence alignments of arbitrary 

length 



mutation, recombination, insertion and 
deletion. 

Simultaneous visualization of multiple 
organisms related by a phylogeny. 



• Accepts complete phylogenetic trees 
and allows path tracing between any 
two points. 



3D models of circular and linear chromosomes 

• Graphical demonstration of local alignment 
of the genomes in which the clustered 
genes are located 

• Global and alignment visualization up to 
several megabases under the same scale 



• Genomes are organized in a vertical heap, 
as in multiple alignments and shaded areas 
links are used to connect genes that belong 
to the same cluster 

• Dynamic and interactive dot-plots 



Pavlopoulos et al. BioData Mining 2013, 6:13 
http://www.biodatannining.0rg/content/6/l/i3 



Page 21 of 25 



non- automated way. Visualization approaches are today called upon to handle huge 
amounts of data, efficiently analyze them and deliver the knowledge to the user in a 
visual way that is concise, coherent and easy to comprehend and interpret. User friend- 
liness, pattern recognition and knowledge extraction are the main goals that an optimal 
visualization tool should achieve. Therefore, tasks like handling the overload of infor- 
mation, displaying data at different resolutions, fast searching or smoother scaling and 
navigation are not trivial when the information to be visualized consists of millions of 
elements and often reaches an enormous level of complexity. Modern libraries, able to 
visually scale millions of data points at different resolutions are nowadays essential. 

Current tools lack dynamic data structure and dynamic indexing support for better 
processing performance. Multi-threaded programming or parallel processing support 
would also be a very intuitive approach to reduce the processing time, when applica- 
tions run in multicore machines with many CPUs. Efficient architecture setup, that 
would decentralize data and distribute the work between the servers and the clients, is 
also a step towards the reduction of processing time. 

While knowledge is currently stored in various databases, distributed across the 
world and analyzed by various workflows, the need of integration among available tools 
is becoming a necessity. Next-generation visualization tools should be able to extract, 
combine and analyze knowledge and deliver it in a meaningful and sensible way. For 
this to happen, international standards should be defined to describe how next gener- 
ation sequencing techniques should store their results and exchange them through the 
web. Unfortunately today, many visualization and analysis approaches are being devel- 
oped independently. Many of the new methods come with their own convenient file 
format to store and present the information, something that will become a problem in 
the future when hundreds of methods will become available. Such cases are widely 
discussed in biological network analysis approaches [124,125]. 

Visual analytics in the future will play an important role to visually allow parameteri- 
zations of various workflows. So far it may be confusing and misleading to the user, 
when various software packages often produce significantly different results just by 
slightly changing the value of a single parameter. Furthermore, different approaches 
can come up with completely different results despite the fact that they try to answer 
the same question. This can be attributed to the fact that they follow a completely dif- 
ferent methodology, therefore highlighting the need for enforcing a more general out- 
put format. Future visualization tools should offer the flexibility to easily integrate and 
perform fine-tuning of parameters in such a way that it allows the end users to readily 
adjust their research to their needs. 

Finally, data integration at different levels varying from tools to concepts is a neces- 
sity. Combining functions from diverse sources varying from annotations to 
microarrays, RNA-Seq and ChlP-Seq data emerges towards a better understanding of 
the information hidden in a genome. Similarly, visual representations well established 
in other scientific areas, such as economics or social studies, should be shared and ap- 
plied to the current field of sequencing. 
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